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Abstract 

Cosmological simulations of structures and galaxies formations have played a fun- 
damental role in the study of the origin, formation and evolution of the Universe. 
These studies improved enormously with the use of supercomputers and parallel 
systems and, recently, grid based systems and Linux clusters. Now we present the 
new version of the tree N-body parallel code FLY that runs on a PC Linux Cluster 
using the one side communication paradigm MPI-2 and we show the performances 
obtained. FLY is included in the Computer Physics Communication Program Li- 
brary. This new version was developed using the Linux Cluster of CINECA, an 
IBM Cluster with 1024 Intel Xeon Pentium IV 3.0 Ghz. The results show that it 
is possible to run a 64 Million particle simulation in less than 15 minutes for each 
timestep, and the code scalability with the number of processors is achieved. This 
lead us to propose FLY as a code to run very large N-Body simulations with more 
than 10 9 particles with the higher resolution of a pure tree code. The FLY new 
version will be available at http://www.ct.astro.it/fly/ and CPC Program Library. 
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1 Introduction 



In cosmological applications, N-body codes are used to study the evolution of 
the structure formation throughout the history of the Universe. In a simula- 
tion, particles represent such large aggregates that galaxies are typically just 
resolved. Simulations must follow a box large enough to accurately represent 
the power spectrum of fluctuations on very large scales, to compare them with 



real data. The number of particles then sets the mass resolution of the simula- 
tion, which we would like to make as fine as possible. This requires very large 
values of N, and state-of-the-art simulations follow up to 10 billion particles 
(17) ffioh and codes that evolve both N-body particles and gas. FLY is a par- 
allel tree code that runs with a very high resolution, N-Body simulation of the 
Large Scale Structure of the Universe, and could be integrated with a code 
that executes the hydrodynamic system evolution using a Paramesh structure 

Among the most adopted N-body codes for cosmological simulations there is 
the Gadget code (fl~6l ). Gadget is a TreeSPH (11) code for cosmological evolu- 
tion, for simulations of cosmological regions, considering both the collisionless 
matter (dark matter) and an ideal gas. The gravitational interactions are com- 
puted using a tree algorithm (a hierarchical multipole expansion), while the 
gas dynamics uses a smoothed particle hydrodynamics schema (SPH). Both 
gas and dark matter are represented by particles. In the new version of Gad- 
get (GADGET-2) (Tisl ) gravitational forces are computed with a hierarchical 
multipole expansion, which can optionally be applied in the form of a TreePM 
algorithm, where only short-range forces are computed with the tree-method 
while long-range forces are determined with Fourier techniques. 
Another largely used code in cosmology is the Enzo code (@). Enzo uses a 
totally different approach to collisionless systems. It allows the execution of 
hydrodynamic and N-Body simulations using the adaptive mesh refinement 
technique (0). The dark matter particles are sampled in a grid structure to 
form a spatially discretized field and Poisson's equation of the dark matter evo- 
lution is solved by using the FFT method. The hydrodynamic part is solved 
by using a modified version of the piecewise parabolic method (PPM) (Il8l ). 
To conclude we cite the PMFAST code al4 ) based on the MPI and OpenMP 
paradigm. The forces are divided into short range components and long range 
components. The short range components are computed by using a fine mesh, 
the long range components are computed by using a coarse mesh (four times 
coarser than the fine mesh) and both use the FFT algorithm. In this scenario 
FLY is a free parallel tree N-body code that allows researchers to run cosmo- 
logical simulations with higher resolution and a very high number of particles. 
Simulations with more than 10 s can be easily done even using a small cluster. 
In the following sections we will describe the main features and the obtained 
performances in a Linux Cluster system. 



2 FLY code description 



FLY is a parallel tree N-body code for cosmological simulations of the Large 
Scale Structure of the Universe based on the Barnes-Hut algorithm (|3|). The 
code is written in Fortran 90 and C, and it is based on the one-side commu- 
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nication paradigm. FLY creates the MPI Window object for one-side commu- 
nication for all the shared arrays that can be accessed from all the processes, 
avoiding any kind of synchronism. The code version 1.0 was originally devel- 
oped on CRAY T3E and SGI ORIGIN systems using the logically SHared 
MEMory access routines (SHMEM). The FLY version 2.1 was implemented 
for IBM SP by using the Low-Level Application Programming Interface rou- 
tines (LAPP). 

This new code (version 3.1) is a stable version that can run on a Linux plat- 
form, from a single PC to a Linux Cluster. This is the evolution of preliminary 
codes and it reaches very high performance in all the systems where it has been 
tested. The main goal is to provide researchers with a powerful code for cosmo- 
logical simulation with higher resolution compared with other public domain 
codes. 

The new version of FLY is implemented by using the MPI-2 standard. The 
first release was implemented in the IBM SP system, but a stable version 
was written by using the MPICH2 library on a PC Linux cluster, obtaining 
very good results (the FLY performance is reported in section 4). MPICH2 
provides a new MPI implementation designed to implement the MPI-2 ad- 
ditions: dynamic process management, one-side operations, parallel I/O and 
other extensions. They provide a vehicle for MPI implementation research and 
to develop new and better parallel programming environments. MPICH2 has a 
set of daemons (called mpd's) that verify the communication among machines 
before running parallel processes. MPICH2 implements a portable mpiexec 
command to start parallel applications. 

2. 1 Equations of motion 

A detailed discussion on the discretized equations of motion used in FLY 
can be found in the reference guide ([]]) paragraph 2. Here we report a short 
summary of the equations. 

The Friedmann-Robertson- Walker metric is characterized by an expansion 
factor a(t), where t is the conformal time. Let Xi(t) be the comoving coordinate 
of the i-th particle and m; its mass, then the equations of motion are given 
by: 

Xi = Vi (1) 
Vi + 2% = -% £ " >!{X: Xj ' + F Ewald {x) (2) 

where G is the gravitational constant, the term, FE W aid{x) represents Ewald 
correction, which takes into account the contribution to the force from the 
periodical boundary conditions, and the dot denotes the derivation compared 



3 



to the conformal time t. We also define the Hubble constant H(t) = a/ a. 
It is more convenient to introduce a set of dimensionless variables: 

x\ = L Xi, t = t r, m'i = M rrii 



In terms of these variables, the dimensionless equations of motion become: 
dx' 



A detailed discussion on the measure units, the choice of the time variable, 
the adopted gravitational potential (in the Plummer form), the choice of the 
dynamic time stepping criterion and the discretized equations used by the 
FLY code can be found in ([]]) paragraph 2. 



2. 2 FL Y technical features 



The domain decomposition criterion is a fundamental point of all the cosmo- 
logical codes. The data distribution criterion must balance the load among the 
processors and minimize the communication on the network. The main data 
structures, particles and tree cells, are statically divided among the processes 
to ensure a good initial balance of the load and to avoid any bottleneck while 
accessing remote data. 

FLY assigns an equal number of particles to each process: running with N p 
processes each process has an equal portion of the particles structure (i.e. 
pos(l:3,N part /Np). Using this kind of assignment we developed a dynamic 
load balance procedure that makes use of the one-side communication char- 
acteristics, data grouping and data buffer. 

Using the FLY sort utility ((5) paragraph 3), a sorted input file is obtained. 
It contains the fields of position and velocity, so that particles with a near 
tag number are also close in the physical space and located in the same pro- 
cess. Considering that the direct interaction among the nearest particles has 
a relevant weight in the force computation, this distribution guarantees the 
minimization of the communications and the maximization of the local com- 
putation. 

The tree cells are numbered progressively from the root, down to the smallest 
cells. The optimal data distribution is reached by using a fine grain data dis- 
tribution. More details are reported in a following paragraphs. 
Another important feature consists in the grouping force calculation. The ba- 
sic idea is to build a single interaction list to be applied to all particles inside 
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a grouping cell C group of the tree. This reduces the number of tree access, and 
builds a single interaction list, that is the list of elements used to compute 
the force for each particle in the grouping cell (0). The last important feature 
consists in the data buffering. The data buffering uses all the free memory, not 
allocated to store arrays containing remote particles and the tree cells prop- 
erties. The policy of management of this structure is based on the common 
management of a cache memory. Every time the processor accesses a remote 
element, it first looks for the local data buffer. If the element is not found, the 
process executes the GET calls to download the remote element and stores it 
in the local data buffer. 



2.3 Dynamic Load Balance 

The Load Balancing and high performances are achieved by using the above 
mentioned grouping features and the one-side communication system described 
in (Q) and and hereafter shortly reported. 

Each particle or grouping cell has an executor processor (hereafter PEx) that 
computes the force for the particles. The PEx for a group is the processor 
where the main number of particles are stored. Equally, the PEx for a particle 
is the processor where the particle properties are stored. First of all, each PEx 
computes the force for all particles in the grouping cells. When a processor has 
no more groups to compute forces, it can start the force computing phase for 
other Cg roU p cells, not yet computed by the default PEx. In a similar way, FLY 
balances the load for the remaining particles; when a processor has no more 
local particles to compute forces, it can start this phase for other particles, 
not yet computed by the default PEx. 

The one-side communication paradigm allows FLY to perform this task with- 
out synchronism or waiting states among the PEs, and to obtain a high load 
balance in this phase. 

In this way each processor starts to work on local groups and particles, at the 
end it can continue to work for remote groups and particles and only stop 
when no more particles in the simulation need to be computed, avoiding the 
load imbalance. 



2.4 Data distribution 

The data distribution is described in (El) and ((3), and hereafter shortly re- 
ported. 

The tree cells are numbered progressively from the root, down to the smallest 
cells. A good data distribution scheme is reached by using a fine grain data 
distribution as reported in the above mentioned articles. The first tree levels 
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contain cells that are always checked to form the list of cells and particles 
needed to compute forces for a fixed body. 

This data distribution presents all the processors from investigating the same 
cells in the same remote memory and avoids the typical problems of access to 
a critical resource: a tree fine grain data distribution allows, on average, all the 
PEs memories to be requested with the same frequency; thus each particle will 
have the same average access time to the tree cells, avoiding the bottleneck 
problem. 

Particle properties are organized with the following schema. Each processor 
has the same number of particles, Nbodies/N .processors, near in space, using 
the sort utilities as described in ((HI). This kind of distribution in contiguous 
blocks, is a good data distribution in terms of measured code performance. The 
list of particles we need to compute forces often includes near particles that 
are locally stored, and the communication is minimized. 



3 The MPI-2 code version 



FLY is based on the one-side communication paradigm. The new version 
adopts the MPICH2 library. The main data structure, that has access from 
remote processes, is declared in a module procedure of FLY (fly_h.F90 rou- 
tine). Then FLY creates the MPI Window object for one-side communication 
for all the shared arrays, with a call like the following: 

CALL MPI_WIN_CREATE(pos, size, real8, MPI_INFO_NULL , 

MPI_C0MM_W0RLD , win_pos , ierr ) 

the following main window objects are created: 

• win_pos, win_vel, win_acc: particles positions velocities and accelerations 

• win_pos_cell, win_mass_cell, win_quad, win_subp, win_grouping: cells posi- 
tions, masses, quadrupole momenta, tree structure and grouping cells. 

Other windows are created for dynamic load balance and global counters. 
The main phases where communication occurs are the following: the tree con- 
struction phase and the force computation. 

During the tree construction phase, all the precesses cooperate to build the 
single tree structure of the simulation. In the tree_gen.F90 routine, each pro- 
cess mainly computes the cells that are locally resident. The tree is built level 
by level, one cycle for each level. Every level subdivides the cells into 8 sub- 
cells and prepares them for the new level. The processes compute the number 
of particles in each sub-cell: sub-cells with more than one particle form the 
cells of the new level. During this phase FLY must access to remote cells. 
Using the data buffering the window locking calls are always of shared type, 
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and the put operation, like the following, is often required 

CALL MPI_WIN_LOCK(MPI_LOCK_SHARED, ind_pe_rmt, 0, 

win_subp, ierror) 
CALL MPI_PUT(subp_ch(K, J) , 1, MPI.INTEGER4 , ind_pe_rmt , 

startlndex, 1, MPI_INTEGER4, win.subp, ierror) 
CALL MPI_WIN_UNLOCK(ind_pe_rmt, win.subp, ierror) 

Sometimes, depending on the data buffer dimension, the accumulate operation, 
like the following, can occur 

CALL MPI_WIN_LOCK(MPI_LOCK_SHARED, ind_pe_rmt, 0, 

win_subp, ierror) 
CALL MPI_ACCUMULATE(subp_ch(K, ind.ch) , 1, MPI.INTEGER4, 

ind_pe_rmt, start Index, 1, MPI_INTEGER4, 
MPI_SUM, win.subp, ierror) 
CALL MPI_WIN_UNLOCK(ind_pe_rmt, win.subp, ierror) 

During the force computation phase of the Barnes-Hut algorithm, FLY mainly 
uses the get procedure: generally speaking, it reads remote bodies and cell 
properties in a tree walk procedure, and computes the force for locally residing 
bodies. 

CALL MPI_WIN_LOCK(MPI_LOCK_SHARED, ind_pe_rmt, 0, 

win_pos, ierror) 
CALL MPI_GET(pos_cell (1) , ndim, MPI.REAL8, ind_pe_rmt, 

startlndex, ndim, MPI_REAL8, win_pos, ierror) 
CALL MPI_GET(pmass(nterms) , 1, MPI.REAL8, ind_pe_rmt, 

startlndex, 1, MPI_REAL8, win_mass_cell , ierror) 
CALL MPI_GET(pquad(l, nterms) , 5, MPI.REAL8, ind_pe_rmt, 

startlndex, 5, MPI_REAL8, win_quad, ierror) 
CALL MPI_WIN_UNLOCK(ind_pe_rmt, win.pos, ierror) 

FLY verifies which remote elements (cells or bodies) must be considered to 
compute the force on a given particle, and gets the remote data. Local data are 
obviously accessed without MPI communication. All the accesses in this phase 
are carried out with a LOCK_SHARED access. For the dynamic load balance 
and the grouping computation, there are few LOCK_EXCLUSIVE calls in 
critical sections, mainly to update global counters in the acc-Comp.F90 rou- 
tine. 

At the end each process updates the particle position and a new timestep 
is started. The code has totally more than 570 MPI calls. More than 110 
calls are for the MPI_WIN_LOCK and UNLOCK windows in shared mode to 
perform get and put operations and few FENCE calls. All the GET opera- 
tions occur during a phase when data in the windows are not updated and 
the PUT operation occurs mainly in physical locations that are accessed by 
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any process. Only few exclusive locks are required. About 60 MPLGET and 
PUT operations are required and few global counters are implemented using 
MPLACCUMULATE calls. 



4 FLY performance 



The FLY MPI2 version was developed on the IBM Linux Cluster at Cineca, 
with the following feature Architecture: IBM Linux Cluster 1350, 512 nodes 
with 2 Processors for each node and 2GB Ram for each processor. Processor 
type: Intel Xeon Pentium IV 3.0 Ghz and 512 KB cache (128 nodes have No- 
cona processors). Internal Network: Myricom LAN Card "C" Version and "D" 
Version. Operating System: Linux SuSE SLES 8. 

The code was compiled using the mpif90 compiler version 8.1 and with basic 
optimization options in order to have performances that could be useful com- 
pared with other generic clusters. We use the following compilation options: 

mpif90 -03 -tpp7 -static -xN 

where 03 enables aggressive optimizations, tpp7 optimizes the code for Pen- 
tium IV processor, static avoids linking with shared libraries and xN generates 
a specialized code to run exclusively on Intel Pentium IV processors and com- 
patible Intel processors. The mpich2 1.0 was installed and used for these tests, 
and native communication protocol was also used. 

We run FLY using a cosmological CDM+A model (Q = 0.3, A = 0.7, h = 0.6) 
with a different number of particles in order to test the scalability of the code 
in the Intel cluster and the scalability of the system. The following paragraph 
reports the results obtained. 



4-1 Scalability 



In this section we report scalability data using two testcases with 2 Million and 
16 Million particles, in a uniform initial condition (z=80). Even if the timestep 
could be twice slower when clusters form, the grouping features of FLY avoid 
this behaviour and a single timestep duration will not increase or will increase 
by no more than 20%. Initial data were generated using a tool based on COS- 
MICS (0), that is a package for computing transfer functions and microwave 
background anisotropy; it also generates gaussian random initial conditions for 
nonlinear structure formation simulations. We modified the original package 
to obtain an output format directly used by FLY. We generated a normalized 
matter power spectrum distribution and constrained random density fields on 
a lattice, using the Hoffman- Riback algorithm (jl2j). Fig. 1 shows a graphical 
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Fig. 1. Timing of FLY phases. Data represent the logarithm value of each FLY 
phase, normalized to the Total time obtained using 48 Processors 
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5.50 


494.52 


512.45 
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80.80 


686.79 


774.53 


4 


83.90 


348.36 


444.10 


8 


48.19 


160.91 


213.49 


12 


37.92 


97.93 


144.17 


16 


34.37 


55.36 


94.12 


24 


19.56 


33.85 


55.26 


32 


19.13 


30.81 


52.74 


48 


12.38 


23.65 


41.45 


64 


12.00 


33.06 


50.72 



Table 1 

Elapsed time running 2 Million particles 

representation of the code scalability with 2 Million particles, increasing the 
number of processors. Data are normalized to the total timestep duration us- 
ing 48 processors (41.5 seconds) and the logarithm value is plotted. Table 1 
displays the elapsed time in seconds of each main phase of the FLY run. A 
single timestep of FLY is mainly for the tree construction phase and for the 
dynamic evolution of particles. We report the tree and the dynamic phases 
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Fig. 2. Elapsed time of a single timestep with 64 Million particles 

together with the total time of a single timestep. This result shows that, in 
this case, it is not very useful to make this simulation with more than 48 pro- 
cessors. 

There is an important behaviour of the system running with one processor 
and with two or more processors: the tree phase increases by one order of 
magnitude. The result measured in the serial run, regards a tree located in 
the same local memory of the processor that runs the application. The tree 
construction phase-time in a parallel run, mainly depends on the atomic oper- 
ation performed to build the tree, that is shared among the processors. All the 
processors cooperate to build the tree, shared in the global memory, and they 
must manage shared counters to perform this task (1). It is not possible to run 
FLY with two or more processors, without building the tree in a parallel way: 
the parallel run can take place only if the tree is shared among the processors 
memory. The code globally has a good scalability, and in particular the dy- 
namic phase, whereas the percentage of the tree part ranges only from 10% to 
35% of the timestep duration. Fig. 2 shows a graphical representation of the 
code scalability with 64 Million particles, increasing the number of processors. 
Data are normalized to the total timestep duration using 64 processors (822.64 
seconds). Table 2 displays the elapsed time, in seconds, of each main phase of 
this case. The results of this run show that the scalability for large LSS sim- 
ulations in this system architecture is very good. They allow us to have the 
same performance than using typical MPP systems. Fig. 2 shows the results 
starting from 16 processors because it is not possible to execute this simula- 
tion with less than 32 GB Ram. Fig. 3 displays us the global results in terms 
of code scalability for this kind of architecture. The results of the simulation, 
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556.93 
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Table 2 

Elapsed time running 64 Million particles 
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Fig. 3. Number of particles computed each second increasing the size of a simulation 

running with 2 Million particles, show that with more than 48 processors, the 
system reaches a performance saturation, whereas a good accordance and a 
good scalability running with more than 48 Million particles is reached. The 
user that wants to run FLY in similar systems must consider the above results 
as a reference case. All data mentioned in the previous figures are considered 
at the beginning of a simulation (redshift 80, in our case) when the particles 
are in a uniform distribution. The FLY code grouping working mode allow 
the user to set a grouping factor so that the error is much lower than the tree 
schema and can be negligible. In this case the elapsed time for each time-step 
does not increase. During the evolution at the beginning there are no cluster 
formations and the elapsed time for each time-step is roughly constant. When 
the simulation starts to form clusters, FLY groups also start to form, as men- 
tioned in section 2.2, and this reduces the time-step duration. More details 
can be found in (4). Fig. 4 reports data of a simulation of 2 Million particles 
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Fig. 4. Elapsed time versus redshift of a 2 Million particle simulation 

without FLY groups and with a grouping factor of " level 7" with no more than 
16 particles in each grouping cell. In this case at z=10 only 164367 particles 
were grouped and the grouping effect was negligible, but at the redshift z=0 
more than 1.2 Million particles were grouped. A similar behaviour is obtained 
increasing the number of processors and/or the particles of a simulation. 



4-2 Isogranularity 



An important test is also given to allow us to make some considerations on 
the performance of this kind of architecture for the FLY code. We measured 
that a run with 16 Million particles using 16 Processors, each having 2 GB 
Ram, produces about 1.6 • 10 10 remote operations, mainly data GET and data 
PUT. FLY uses the data buffer as described in p), storing remote data in a 
local buffer that is managed as in the cache memory. 

Fig. 5 reports the behaviour of the system increasing the size of a simulation 
as the number of processors and of the global RAM grows. The base case was 
to run 1 million particles for each processor with 2 GB Ram. We test the 
parallel case only and the curve in the figure starts from 2 Million particles 
running in two processors. The total timestep and the dynamical part of the 
code show that using two processors in the same node we do not obtain the 
best performances, due to the intra-node network contention for the access to 
critical resources. With more than two processors we measured an increased 
elapsed time with the number of processors; the total elapsed time with 64 
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Fig. 5. Elapsed time increasing size of a simulation 

processors is twice as 4 processors, this behaviour depending on the network 
contention of the system, in 



5 Testcase Cosmological simulation 

This section gives a short description of an example of a simple FLY run with 
the aim to provide a testcase the user can execute. FLY is included in the 
CPC program library. The testcase we describe can be downloaded from the 
FLY page, at |http: //www. ct.astro.it /fly/ 

We run a 2 Million particle simulation. The initial conditions were created 
using Cosmics (0) in a 50/i _1 Mpc cubic region, within a CDM+A cosmogeny 
corresponding to Q = 0.3, fii am da = 0.7, zstart = 80, and h = 0.6. FLY imple- 
ments a set of cosmological equations of motion, solving the standard particle 
equations of motion for a Friedmann cosmology, with the Ewald correction, 
which takes into account the contribution to the force from the infinite replicas 
of the simulation box over the spatial directions. All the parameters reported 
in the testcase are discussed in (|l|). 

The user must download FLY source code and compile it using the parameter 
nknots = 49 in the fly_h.F90 module file and files provided in the link testcase. 
We executed the simulation in the Cineca Linux cluster mentioned above, with 
32 processors. The simulation evolved without grouping factor for 66 timesteps 
in a total time of about 100 minutes and using the native communication pro- 
tocol. 
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6 Conclusions 



This new free release of FLY (version 3.1) will give a contribution to the 
astrophysical community for two main new features, that can give new oppor- 
tunities and new results in the cosmological field. 

It is now possible to execute LSS cosmological simulations using FLY, a tree 
N-body code, with high resolution, using a Linux Cluster with the MPICH2 
library. Moreover, FLY has a new interface to the code that can communicate 
data using a Paramesh like structure, giving researchers new possibilities. It is 
possible to run two separate codes, both with high resolution, using the same 
computational domain. 

This code interoperability will be also exploited in the grid environment, spec- 
ifying a devoted node to run FLY and another node to execute a fluid dynamic 
code. This new possibility will be considered in a new project to be developed 
at the INAF of Catania in the next few years. 
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